Vers une modélisation statistique multi-niveau du langage, application aux langues peu dotées. (Toward a multi-level statistical language modeling for under-resourced language)
نویسنده
چکیده
This PhD thesis focuses on the problems encountered when developing automatic speech recognition for under-resourced languages with a writing system without explicit separation between words. The specificity of the languages covered in our work requires automatic segmentation of text corpus into words in order to make the n-gram language modeling applicable. While the lack of text data has an impact on the performance of language model, the errors introduced by automatic segmentation can make these data even less usable. To deal with these problems, our research focuses primarily on language modeling, and in particular the choice of lexical and sub-lexical units, used by the recognition systems. We investigate the use of multiple units in speech recognition system. At language models level, the models are trained with hybrid vocabularies created using both the lexical and the sub-lexical unit. At the system output level, we try to combine the outputs of several recognition systems. Each system is based on a different modeling unit : lexical or sub-lexical. To better exploit the textual data using different views on the same data, we propose a method that performs multiple segmentations on the training corpus instead of a conventional single segmentation. This method based on finite state machines allows generating all possible segmentations from a sequence of characters and then we can extract n-grams to train the language model. It allows finding the n-grams not found by unique segmentation method and adding new n-grams in the language model. We validate these modeling approaches based on multiple units in recognition systems for a group of languages : Khmer, Vietnamese, Thai and Laotian.
منابع مشابه
A State of the Art of Word Sense Induction: A Way Towards Word Sense Disambiguation for Under-Resourced Languages
______________________________________________________________________________________________ Word Sense Disambiguation (WSD), the process of automatically identifying the meaning of a polysemous word in a sentence, is a fundamental task in Natural Language Processing (NLP). Progress in this approach to WSD opens up many promising developments in the field of NLP and its applications. Indeed, ...
متن کاملG-OWL : Vers un langage de modélisation graphique, polymorphique et typé pour la construction d'une ontologie dans la notation OWL
Résumé : Le Web Ontology Language (OWL) standardisé par le W3C a pour objectif d’offrir un langage de conception d’ontologies pour le web sémantique. L’ingénierie d’une ontologie est une activité complexe nécessitant une habilité peu accessible à des experts de contenu. En revanche, pour modéliser du contenu métier, la modélisation graphique semi-formelle est une technique souvent employée pour...
متن کاملA State of the Art of Word Sense Induction: A Way Towards Word Sense Disambiguation for Under-Resourced Languages (État de l'art de l'induction de sens: une voie vers la désambiguïsation lexicale pour les langues peu dotées) [in French]
متن کامل
Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]
Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or...
متن کاملTransformation des contraintes d'intégrité - Des modèles conceptuels vers le relationnel
RÉSUMÉ. Dans un modèle conceptuel, les contraintes d'intégrité représentent une partie intégrante dont la définition est nécessaire pour exprimer aux mieux la sémantique du réel perçu. Toutefois, ces contraintes même si elles sont exprimées au niveau conceptuel, elles sont très souvent ignorées lors du passage vers le niveau logique. En pratique, la majorité des AGL de modélisation ne supporten...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010